kermit.columbia.edu

home *** CD-ROM | disk | FTP | other *** search

/ kermit.columbia.edu / kermit.columbia.edu.tar / kermit.columbia.edu / newsgroups / misc.20000114-20000217 / 000123_news@columbia.edu _Sun Jan 23 19:57:18 2000.msg < prev next >

Wrap

Internet Message Format | 2000-02-16 | 9KB

Return-Path: <news@columbia.edu> Received: from newsmaster.cc.columbia.edu (newsmaster.cc.columbia.edu [128.59.59.30]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA27636 for <kermit.misc@watsun.cc.columbia.edu>; Sun, 23 Jan 2000 19:57:18 -0500 (EST) Received: (from news@localhost) by newsmaster.cc.columbia.edu (8.8.5/8.8.5) id TAA19711 for kermit.misc@watsun.cc.columbia.edu; Sun, 23 Jan 2000 19:28:34 -0500 (EST) X-Authentication-Warning: newsmaster.cc.columbia.edu: news set sender to <news> using -f From: fdc@watsun.cc.columbia.edu (Frank da Cruz) Subject: Case Study #14: Character Sets Date: 24 Jan 2000 00:28:33 GMT Organization: Columbia University Message-ID: <86g6bh$j7s$1@newsmaster.cc.columbia.edu> To: kermit.misc@columbia.edu Some recent questions about character-sets prompt today's discussion. As you probably know, Kermit software is practically (and perhaps actually) unique among communication software packages in its ability to convert the character sets of text files while transferring them between platforms that use different ones. In the recent postings, the need was to transfer Portuguese text between a PC that used PC Code Page 850 (CP850) and a UNIX system that used some other encoding. Kermit protocol and software have been able to handle such tasks since the 1980s. This feature is important to everybody who reads and writes a language that uses accented and/or non-Roman characters -- in other words, the overwhelming majority of humanity. Only a few languages are written entirely in plain ABCs: English, Latin, Malay, and maybe Dutch. Nearly all the others need accents or non-ABC characters. But accented and non-Roman characters are represented differently on different computers. So (returning to our example) if you copy Portuguese text from (say) DOS or Windows to (say) HP-UX or VMS, all the accented letters become, well, garbage. If you copy Greek, Russian, or Hebrew text between the same two computers, ALL the letters become garbage. What good is accomplished by moving text from one computer to another if the result is gibberish? In the world at large, text-file transfer should provide for character-set conversion. The Kermit protocol does; the method was worked out in the late 1980s and is written up in papers you can find at: http://www.columbia.edu/kermit/papers.html Suppose you want to send Portuguese text from DOS to HP-UX (in DOS, Portuguese text can be encoded in CP437, CP850, or CP860, each of them different; your first job is to find out which one is actually used on your PC). Let's say the encoding is CP850. You would tell Kermit on the PC to: set file character-set cp850 Use C-Kermit's menu-on-demand feature to find out what file character-sets are available: set file character-set ? This gives you the complete list. But PC Kermit doesn't send CP850 on the wire, because it's a private (proprietary) character set. Only standard character sets should be used between computers. Kermit supports a small number of standard transfer character-sets, each one covering its own group of languages (and therefore file character-sets). You have to tell it which one to use; in this case, ISO 8859-1 Latin Alphabet 1: set transfer character-set latin1 You can see the list of available transfer character-sets with: set transfer character-set ? (If you obtained the two lists, you should have seen about 50 file character-sets and 10 transfer character-sets, enough to cover the West and East European Roman-alphabet languages, plus languages written in Cyrillic and Hebrew, plus Greek and Japanese.) Now when PC Kermit sends a file in text mode, it converts the file from CP850 to Latin-1, and announces the Latin-1 encoding to the receiving Kermit program. Meanwhile, because the HP-Roman8 character set is used on HP-UX, which is different not only from the PC code pages just mentioned but also from Latin-1, HP-UX C-Kermit must be told to: set file character-set hp-roman8 The final step is to make sure the file sender transfers the file in text mode, rather than binary mode, because character-set and record-format conversions take place only in text mode: set file type text Now the file can be transferred. To summarize, the following commands are given to the file sender: set file character-set cp850 ; Identify the source file encoding set transfer character-set latin1 ; Specify the transfer encoding set file type text ; Choose text mode send quilombo.txt ; Send a file and to the file receiver: set file character-set hp-roman8 ; Identify target file encoding receive ; Receive the file The file sender tells the file receiver to expect a text file encoded in Latin-1; the file sender converts from CP850 to Latin-1, and the file receiver converts from Latin-1 to HP-Roman8. To send files in the other direction, simply exchange the SEND and RECEIVE commands (keeping the SET FILE TYPE TEXT command with the file sender); the rest stays the same. This is all old news, but it might still be new to many readers. The procedures and specific character sets are documented in Chapter 16 of "Using C-Kermit", 2nd Edition, and in other Kermit manuals. All of the facilities discussed until now are found in C-Kermit 5A and later, MS-DOS Kermit 3.0 and later, Kermit 95 (all versions), and IBM Mainframe Kermit since (I think) version 4.1. So what's new in C-Kermit 7.0 and the forthcoming 1.1.18 release of Kermit 95? Lots of new character sets have been added, including many for Eastern Europe and the former Soviet Union, as well as those used for Greek. And Unicode, the new Univeral Character Set, which was discussed in a previous posting. So now the possibilities for character-set conversion are wider than ever. And in keeping with our goal that C-Kermit 7.0 "just work" for most people most of the time, we have also added not just automatic text/binary mode switching, discussed previously, but also automatic character-set associations, in which each file character-set is associated with an appropriate transfer character-set, and vice versa. C-Kermit comes with a comprehensive table of associates preloaded, which you can view with: show associations Perhaps you were wondering (if you don't have a manual) how you were supposed to know that Latin-1 was the appropriate transfer character-set for CP850? Good question! Now this information is built in to C-Kermit. So whenever you pick a file character-set, C-Kermit picks the appropriate transfer character-set for you, and vice versa. Furthermore, whenever C-Kermit receives a text file in a particular transfer character-set, it converts it to the appropriate file character-set automatically, even if you have not told it which one to use. So the sequence above is now simplified. At the sender: set file character-set cp850 ; Identify the source file encoding send *.* ; Send some files and at the file receiver: receive ; Receive the file Appropriate associations are built in for each platform. So you just have to start the ball rolling by specifying the encoding of the source file; the rest flows from there. And now because of automatic text/binary mode switching, you can send a mixed group of text and binary files and have the character-set conversions applied only to the text files. Of course you can change associations if you need to. The command is ASSOCIATE. You can also turn this whole feature on and off with SET SEND (and RECEIVE) CHARACTER-SET-SELECTION. For complete details about character-set associations, see Section 6.5 of the ckermit2.txt file. So now C-Kermit is just about as automatic as it can be in this area. The one thing it can't do is figure out automatically the encoding of a file. Some people believe this can be done, but I'm not one of them. Operating systems have nevere tagged files by encoding, and guessing the encoding from inspection is highly unreliable. By the way, C-Kermit's character-set conversion capabilities are not limited to file transfer. They are also available in terminal (CONNECT) mode. In this case you choose the translation with: set terminal character-set <remote-set> [ <local-set> ] The <local-set> defaults to C-Kermit's current file character-set. Again, type a question mark in the character-set field to get a list of available choices. Finally, you can also use C-Kermit to convert a local file from one character-set to another. For example, to convert the file oofa.txt from Latin-1 to the UTF-8 form of Unicode, and store the result as oofa.utf8, the command would be: translate oofa.txt latin1 utf8 oofa.utf8 This is nothing new, except for the expanded character-set choices. - Frank